Effective Document-Level Features for Chinese Patent Word Segmentation

نویسندگان

  • Si Li
  • Nianwen Xue
چکیده

A patent is a property right for an invention granted by the government to the inventor. Patents often have a high concentration of scientific and technical terms that are rare in everyday language. However, some scientific and technical terms usually appear with high frequency only in one specific patent. In this paper, we propose a pragmatic approach to Chinese word segmentation on patents where we train a sequence labeling model based on a group of novel document-level features. Experiments show that the accuracy of our model reached 96.3% (F1 score) on the development set and 95.0% on a held-out test set.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Patent Translation using Bilingual Term Extraction and Re-tokenization for Chinese-Japanese

Unlike European languages, many Asian languages like Chinese and Japanese do not have typographic boundaries in written system. Word segmentation (tokenization) that break sentences down into individual words (tokens) is normally treated as the first step for machine translation (MT). For Chinese and Japanese, different rules and segmentation tools lead different segmentation results in differe...

متن کامل

Enhancing Chinese Word Segmentation Using Unlabeled Data

This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In ad...

متن کامل

Towards Accurate Word Segmentation for Chinese Patents

A patent is a property right for an invention granted by the government to the inventor. An invention is a solution to a specific technological problem. So patents often have a high concentration of scientific and technical terms that are rare in everyday language. The Chinese word segmentation model trained on currently available everyday language data sets performs poorly because it cannot ef...

متن کامل

A MMSM-based Hybrid Method for Chinese MicroBlog Word Segmentation

After years of researches, Chinese word segmentation has achieved quite high precisions for formal style text. However, the performance of segmentation is not so satisfying for MicroBlog corpora. In this paper we describe a scheme for Chinese word segmentation for, MicroBlog which integrates the characterbased and word-based information in the directed graph generated by MMSM model. Word-level ...

متن کامل

Experimental Comparison of Discriminative Learning Approaches for Chinese Word Segmentation

Natural language processing tasks assume that the input is tokenized into individual words. In languages like Chinese, however, such tokens are not available in the written form. This thesis explores the use of machine learning to segment Chinese sentences into word tokens. We conduct a detailed experimental comparison between various methods for word segmentation. We have built two Chinese wor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014